In this notebook:
Key insight:
import sys
import pandas as pd
import os
from pandas_profiling import ProfileReport
import numpy as np
sys.path.append("../utils/")
from preprocessing import extract_text_sections
from preprocessing import get_data
Load, explore, and prepare all required data.
data_path = "../../data/msk-redefining-cancer-treatment"
# Training Data - Text and Genetic Variants Information
training_merge_df = get_data(
text_file_path="raw/training_text", variants_file_path="raw/training_variants"
)
training_size = training_merge_df.shape[0]
print("Number of Training Samples", training_size)
training_merge_df.head()
# Validation Data - Text and Genetic Variants Information
validation_merge_df = get_data(
text_file_path="raw/test_text",
variants_file_path="raw/test_variants",
solution_file_path="raw/stage1_solution_filtered.csv",
)
validation_size = validation_merge_df.shape[0]
print("Number of Validation Samples:", validation_size)
raw_data_df = training_merge_df.append(validation_merge_df, sort=False)
Class Definitions:
raw_data_df[raw_data_df["Variation"] == "V391I"]
raw_data_df[raw_data_df["Variation"] == "V391I"]["Text"].tolist()[0][
31228 - 35 : 31228 + 43
]
In the text belonging to the CBL V391I genetic variation we could find the section ''mutations (L399V, G375P, P395A and V391I) which attenuated the CBL E3 activity'. This reflects label 4, indicating a loss of function.
ProfileReport(raw_data_df).to_notebook_iframe()
train_processed = pd.read_csv(
os.path.join(data_path, "interim/training_data_additional_features")
)
ProfileReport(train_processed).to_notebook_iframe()